feat: add inlining options and a new `inline` command #374

mikix · 2025-01-16T19:45:18Z

Inlining here means downloading attachment URLs and insert them as attachment data, allowing you to download the attachment just once, at the cost of extra storage space.

This is often useful in preparation for NLP tasks, where you want to often refer back to DocumentReference clinical notes, but do not want to constantly deal with the flakiness of an EHR server. (Or losing access to that server.)

New features:

There is a new inline command that can inline an existing folder of NDJSON.
The export and etl commands both now accept an opt-in flag to inline data after performing a bulk export.
If the server provides a Retry-After header that is a timestamp (rather than a number of seconds), we now parse that correctly.
All server requests are retried at least a little bit upon errors. (previously, only bulk export requests were retried - now every time we hit the server, so even for Medication downloads or DocRef attachment downloads during NLP tasks).

Behavior changes:

If the server gives us a 429 error, we now log it as an error message, and don't log a successful bulk export progress call.
If the server gives us a 429 error, we use the next exponential backoff delay instead of hard coding 60 seconds as the delay.
If the server gives us a Retry-After header on an error message, we no longer unconditionally accept and use it. Rather the requested delay is capped by our next exponential backoff delay. That is, the server's Retry-After time will be used if it's LESS than our next backoff, but if it's longer, we'll still use our own backoff. This is lightly hostile, but (a) it's only done on error cases, (b) our backoff times are generous, and counted in minutes not seconds, and (c) it lets us guarantee a max delay time for callers.
Instead of retrying on 429 and ALL 5xx errors, there's a specific list of error codes that we retry on. Currently it's 408, 429, 500, 502, 503, and 504.
Have the bulk exporter time out after 30 days, rather than one. We've seen Epic exports take a couple weeks.

Sample Output

Inlining… ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 9:32:47
                      Attachments  Resources
 Total examined       168,415      138,501
 Newly inlined        148,074      121,427
 Fatal errors          20,272       20,272
 Retried but gave up       69           69

Checklist

Consider if documentation (like in docs/) needs to be updated
Consider if tests should be added

github-actions · 2025-01-16T20:02:24Z

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines	Covered	Coverage	Threshold	Status
3877	3831	99%	98%	🟢

New Files

File	Coverage	Status
cumulus_etl/inliner/init.py	100%	🟢
cumulus_etl/inliner/cli.py	100%	🟢
cumulus_etl/inliner/inliner.py	100%	🟢
cumulus_etl/inliner/reader.py	100%	🟢
cumulus_etl/inliner/writer.py	100%	🟢
TOTAL	100%	🟢

Modified Files

File	Coverage	Status
cumulus_etl/init.py	100%	🟢
cumulus_etl/cli.py	100%	🟢
cumulus_etl/cli_utils.py	100%	🟢
cumulus_etl/common.py	100%	🟢
cumulus_etl/errors.py	100%	🟢
cumulus_etl/etl/cli.py	100%	🟢
cumulus_etl/etl/tasks/task_factory.py	100%	🟢
cumulus_etl/export/cli.py	100%	🟢
cumulus_etl/fhir/init.py	100%	🟢
cumulus_etl/fhir/fhir_client.py	100%	🟢
cumulus_etl/fhir/fhir_utils.py	100%	🟢
cumulus_etl/loaders/fhir/bulk_export.py	100%	🟢
cumulus_etl/loaders/fhir/ndjson_loader.py	100%	🟢
cumulus_etl/store.py	100%	🟢
TOTAL	100%	🟢

updated for commit: da3d4c1 by action🐍

mikix · 2025-01-22T13:17:15Z

cumulus_etl/fhir/fhir_client.py

@@ -86,7 +90,15 @@ async def __aexit__(self, exc_type, exc_value, traceback):
            await self._session.aclose()

    async def request(


Let me lightly explain some of the refactoring here.

Before, FhirClient.request was a single request. The retry logic was in the bulk export code, which retried on both error cases and the expected 202 status code for "still working on it".

Now, I've separated the bulk export retry logic into error-path and 202-path, and moved the error-path into this FhirClient.request method. So now any code making requests to the server has easy retries (enabled by default for any request).

mikix · 2025-01-27T18:00:25Z

.github/workflows/ci.yaml

-          sed -i 's/"generated_on": "[^"]*", //g' $OUTDIR/*.ndjson
+          sed -i 's/"generated_on":"[^"]*",//g' $OUTDIR/*.ndjson


Adjusted to remove expected whitespace because this PR also switches to a more condensed NDJSON output mode for the ETL that removes whitespace (in an attempt to reduce bloat a little bit when inlining).

mikix · 2025-01-27T18:04:07Z

cumulus_etl/inliner/inliner.py

+                # Look at twice the allowed connections - downloads will block, but that's OK.
+                # This will allow us to download some attachments while other workers are sleeping
+                # because they are waiting to retry due to an HTTP error.
+                peek_at=fhir.FhirClient.MAX_CONNECTIONS * 2,


This is lightly hostile because if a server is actually overloaded (vs the normal random flakiness we tend to see), we might be hitting it a little more than we ought to... I'm testing right now on this a little to see if removing this bit changes anything on a sample pull against Cerner. But I'm inclined to try running with this for performance reasons (under the assumption that we rarely get an authentic overloaded error and just random flakiness is more likely).

should this be configurable?

Just tested this - if we only peek at MAX_CONNECTIONS at a time, we're roughly 10% slower (took 11h instead of 10h for a big inline job).

should this be configurable?

The times-two approach feels too nitty gritty to configure. It's just a way to keep some balls spinning while others are blocked on the timeouts. I feel like this number would scale with the length of the timeouts, not server performance. So if we wanted to configure this, you'd probably also want to configure the timeouts. I dunno, too low level in my head.

mikix · 2025-01-29T14:30:12Z

pyproject.toml

-    "pyarrow < 19",
+    "pyarrow < 20",


Unrelated, I just stole this from a dependabot PR that was consistently failing CI for no good reason.

dogversioning

Mostly skimmed the tests, otherwise looks good

dogversioning · 2025-01-27T19:01:02Z

cumulus_etl/fhir/fhir_client.py

@@ -28,6 +30,9 @@ class FhirClient:
    See https://hl7.org/fhir/smart-app-launch/backend-services.html for details.
    """

+    # Limit the number of connections open at once, because EHRs tend to be very busy.
+    MAX_CONNECTIONS = 5


do we want this to be configurable?

Possibly? This was set by me way back when after doing some small testing on our Cerner staging instance. I didn't really re-examine this default value this time around (i.e. didn't test different numbers during inlining). This current value is not scientific as is, and definitely could change based on the server size / load.

But also... I dunno, this inliner is the first place it'll super matter (other network request paths aren't fully parallel yet). And it's such a long job, that tweaking a config option like this to improve performance by 10% or something feels like a lot to ask for a casual user.

But I'm not opposed to allowing customization. The above is just an explanation of why I haven't done it yet. I might want to wait until we notice a problem with the default, or a problem with the speed of inlining jobs that we think we can push, just to delay adding complexity.

When we get access to BCH Epic, I'd like to do some speed tests there with different numbers too.

dogversioning · 2025-01-27T19:27:49Z

cumulus_etl/inliner/inliner.py

+
+
+@dataclasses.dataclass
+class Stats:


suggestion: InliningStats?

dogversioning · 2025-01-27T19:49:28Z

cumulus_etl/inliner/inliner.py

+                # Look at twice the allowed connections - downloads will block, but that's OK.
+                # This will allow us to download some attachments while other workers are sleeping
+                # because they are waiting to retry due to an HTTP error.
+                peek_at=fhir.FhirClient.MAX_CONNECTIONS * 2,


should this be configurable?

dogversioning · 2025-01-27T19:57:43Z

cumulus_etl/loaders/fhir/bulk_export.py

-    _TIMEOUT_THRESHOLD = 60 * 60 * 24  # a day, which is probably an overly generous timeout
+    _TIMEOUT_THRESHOLD = 60 * 60 * 24 * 30  # thirty days (we've seen some multi-week Epic waits)


the saddest revision of all

dogversioning · 2025-02-04T14:16:45Z

cumulus_etl/inliner/writer.py

+    Note that this is not atomic - partial writes will make it to the target file.
+    And queued writes may not make it to the target file at all, if interrupted.


this makes me nervous - i think it's probably right to push out this complexity til later, but this feels like something that might cause problems later

Fair point. In this context, I think it's safe. We write the inlined files to a temp location, then atomically swap out the old one when done, assuming there were no errors.

Inlining here means downloading attachment URLs and insert them as attachment data, allowing you to download the attachment just once, at the cost of extra storage space. This is often useful in preparation for NLP tasks, where you want to often refer back to DocumentReference clinical notes, but do not want to constantly deal with the flakiness of an EHR server. (Or losing access to that server.) New features: - There is a new `inline` command that can inline an existing folder of NDJSON. - The `export` and `etl` commands both now accept an opt-in flag to inline data after performing a bulk export. - If the server provides a Retry-After header that is a timestamp (rather than a number of seconds), we now parse that correctly. - All server requests are retried at least a little bit upon errors. (previously, only bulk export requests were retried - now every time we hit the server, so even for Medication downloads or DocRef attachment downloads during NLP tasks. Behavior changes: - If the server gives us a 429 error, we now log it as an error message, and don't log a successful progress call. - If the server gives us a 429 error, we use the next exponential backoff delay instead of hard coding 60 seconds as the delay. - If the server gives us a Retry-After header on an error message, we no longer unconditionally accept and use it. Rather the requested delay is capped by our next exponential backoff delay. That is, the server's Retry-After time will be used if it's LESS than our next backoff, but if it's longer, we'll still use our own backoff. This is lightly hostile, but (a) it's only done on error cases, (b) our backoff times are generous, and counted in minutes not seconds, and (c) it lets us guarantee a max delay time for callers. - Instead of retrying on 429 and ALL 5xx errors, there's a specific list of error codes that we retry on. Currently it's 408, 429, 500, 502, 503, and 504. - Have the bulk exporter time out after 30 days, rather than one. We've seen Epic exports take a couple weeks.

mikix force-pushed the mikix/inline branch 4 times, most recently from 2882da0 to cce70c5 Compare January 22, 2025 13:12

mikix commented Jan 22, 2025

View reviewed changes

mikix force-pushed the mikix/inline branch 3 times, most recently from 5abc924 to 01b515b Compare January 27, 2025 17:58

mikix commented Jan 27, 2025

View reviewed changes

mikix marked this pull request as ready for review January 27, 2025 18:09

mikix force-pushed the mikix/inline branch from 01b515b to 7f2d0ab Compare January 28, 2025 20:57

mikix commented Jan 29, 2025

View reviewed changes

dogversioning approved these changes Feb 4, 2025

View reviewed changes

mikix force-pushed the mikix/inline branch from 7f2d0ab to da3d4c1 Compare February 4, 2025 15:08

mikix merged commit 9bba38d into main Feb 4, 2025
3 checks passed

mikix deleted the mikix/inline branch February 4, 2025 15:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add inlining options and a new `inline` command #374

feat: add inlining options and a new `inline` command #374

mikix commented Jan 16, 2025 •

edited

Loading

github-actions bot commented Jan 16, 2025 •

edited

Loading

mikix Jan 22, 2025 •

edited

Loading

mikix Jan 27, 2025

mikix Jan 27, 2025

dogversioning Jan 27, 2025

mikix Jan 29, 2025

mikix Feb 4, 2025

mikix Jan 29, 2025

dogversioning left a comment

dogversioning Jan 27, 2025

mikix Feb 4, 2025

dogversioning Jan 27, 2025

mikix Feb 4, 2025

dogversioning Jan 27, 2025

dogversioning Jan 27, 2025

dogversioning Feb 4, 2025

mikix Feb 4, 2025

		@@ -86,7 +90,15 @@ async def __aexit__(self, exc_type, exc_value, traceback):
		await self._session.aclose()

		async def request(

		sed -i 's/"generated_on": "[^"]", //g' $OUTDIR/.ndjson
		sed -i 's/"generated_on":"[^"]",//g' $OUTDIR/.ndjson

		_TIMEOUT_THRESHOLD = 60 * 60 * 24 # a day, which is probably an overly generous timeout
		_TIMEOUT_THRESHOLD = 60 * 60 * 24 * 30 # thirty days (we've seen some multi-week Epic waits)

		Note that this is not atomic - partial writes will make it to the target file.
		And queued writes may not make it to the target file at all, if interrupted.

feat: add inlining options and a new inline command #374

feat: add inlining options and a new inline command #374

Conversation

mikix commented Jan 16, 2025 • edited Loading

Sample Output

Checklist

github-actions bot commented Jan 16, 2025 • edited Loading

☂️ Python Coverage

Overall Coverage

New Files

Modified Files

mikix Jan 22, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dogversioning left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

feat: add inlining options and a new `inline` command #374

feat: add inlining options and a new `inline` command #374

mikix commented Jan 16, 2025 •

edited

Loading

github-actions bot commented Jan 16, 2025 •

edited

Loading

mikix Jan 22, 2025 •

edited

Loading